{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Calling the SHARE API\n",
    "----\n",
    "Here are some working examples of how to query the current scrAPI database for metrics of results coming through the SHARE Notifiation Service.\n",
    "\n",
    "*these slides and example notebooks*:  https://osf.io/bygau/\n",
    "\n",
    "slides also here: https://github.com/erinspace/share_tutorials"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Setup\n",
    "\n",
    "- All of the examples I use here will be using python and some basic libraries\n",
    "    - [Here's a basic guide to getting started with python](https://wiki.python.org/moin/BeginnersGuide)\n",
    "- Code and setup instructions on github at:\n",
    "     - https://github.com/erinspace/share_tutorials\n",
    "\n",
    "- To run these examples on your machine, you'll need to install some basic python packages\n",
    "    - Make sure to use a virtual enviornment to install python packages:\n",
    "        - https://virtualenv.readthedocs.org/en/latest/\n",
    "    - Using your terminal, run ```pip install -r requirements.txt``` inside your virtual enviornment\n",
    "- Run the Jupyter notebook server from the command line:\n",
    "    ```jupyter notebook```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Get a List of the Current SHARE Providers\n",
    "----\n",
    "We'll make an API call to find:\n",
    "- The official name of each SHARE Provider\n",
    "- The URL for the home page of each SHARE Provider\n",
    "- The shortname, or nickname of the SHARE provider for internal use\n",
    "    - We'll use this name when querying for documents from this source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "data = requests.get('https://osf.io/api/v1/share/providers/').json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### This is a lot of information!\n",
    "\n",
    "Let's display this in a way that looks a little nicer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from IPython.display import Image, display\n",
    "\n",
    "for source in data['providerMap'].keys():\n",
    "    display(Image(url=data['providerMap'][source]['favicon']))\n",
    "    print(\n",
    "        '{}\\n{}\\n{}\\n'.format(\n",
    "            data['providerMap'][source]['long_name'].encode('utf-8'),\n",
    "            data['providerMap'][source]['url'],\n",
    "            data['providerMap'][source]['short_name']\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## SHARE Schema\n",
    "\n",
    "Required fields:\n",
    "- title\n",
    "- contributors\n",
    "- uris\n",
    "- providerUpdatedDateTime\n",
    "\n",
    "We add some information after each document is harvested inside the field shareProperties, including:\n",
    "- source (where the document was originally harvested)\n",
    "- docID  (a unique identifier for that object from that source)\n",
    "\n",
    "These two fields can be combined to make a unique document identifier.\n",
    "\n",
    "See more details about the SHARE Schema, including examples of documents with all of the fields, here:\n",
    "https://osf.io/wunk7/wiki/home/?view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Simple Queries\n",
    "\n",
    "- We need a URL to use to access the SHARE API.\n",
    "- We will add arguments to this URL to shape our request\n",
    "    - size: how many results we'll return\n",
    "    - sort: how we want the results to be sorted\n",
    "    - from: where to start in the resutls returned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "OSF_APP_URL = 'https://osf.io/api/v1/share/search/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import furl\n",
    "\n",
    "search_url = furl.furl(OSF_APP_URL)\n",
    "search_url.args['size'] = 3\n",
    "search_url.args['sort'] = 'providerUpdatedDateTime'\n",
    "search_url.args['from'] = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Our Query URL So far"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print('The request URL is {}'.format(search_url.url))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Our results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "recent_results = requests.get(search_url.url).json()\n",
    "\n",
    "for result in recent_results['results']:\n",
    "    print(\n",
    "        '{} -- from {} -- updated on {}'.format(\n",
    "            result['title'].encode('utf-8'),\n",
    "            result['shareProperties']['source'],\n",
    "            datetime.strftime(datetime.strptime(result['providerUpdatedDateTime'], \"%Y-%m-%dT%H:%M:%S+00:00\"), '%B %d %Y')\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Narrowing Results by Source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "search_url.args['q'] = 'shareProperties.source:mit'\n",
    "recent_results = requests.get(search_url.url).json()\n",
    "\n",
    "print('The request URL is {}'.format(search_url.url))\n",
    "print('---------')\n",
    "for result in recent_results['results']:\n",
    "    print(\n",
    "        '{} -- from {} -- updated on {}'.format(\n",
    "            result['title'].encode('utf-8'),\n",
    "            result['shareProperties']['source'],\n",
    "            datetime.strftime(datetime.strptime(result['providerUpdatedDateTime'], \"%Y-%m-%dT%H:%M:%S+00:00\"), '%B %d %Y')\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Complex Queries\n",
    "- The SHARE Search API runs on elasticsearch\n",
    "- More information on how to format elasticsearch queries: \n",
    "    - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Defining a Helper Function\n",
    "- We'll use this helper function in later examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "def query_share(url, query):\n",
    "    # A helper function that will use the requests library,\n",
    "    # pass along the correct headers,\n",
    "    # and make the query we want\n",
    "    headers = {'Content-Type': 'application/json'}\n",
    "    data = json.dumps(query)\n",
    "    return requests.post(url, headers=headers, data=data, verify=False).json()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Building a Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "sponsorship_query = {\n",
    "    \"size\": 5,\n",
    "    \"query\": {\n",
    "        \"filtered\": {\n",
    "            \"filter\": {\n",
    "                \"exists\": {\n",
    "                    \"field\": \"sponsorships\"\n",
    "                }\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Running the Query and Printing Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results = query_share(search_url.url, sponsorship_query)\n",
    "\n",
    "for item in results['results']:\n",
    "    print('{} -- from source {} -- sponsored by {}'.format(\n",
    "            item['title'].encode('utf-8'),\n",
    "            item['shareProperties']['source'].encode('utf-8'),\n",
    "            ' '.join(\n",
    "                [sponsor['sponsor']['sponsorName'] for sponsor in item['sponsorships']]\n",
    "            )\n",
    "        )\n",
    "    )\n",
    "    print('-------------------')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### New Query\n",
    "\n",
    "How many results *do not* have subjects?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "no_subjects_query = {\n",
    "    \"query\": {\n",
    "        \"query_string\": {\n",
    "            \"analyze_wildcard\": True, \n",
    "            \"query\": \"NOT subjects:*\"\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results_with_no_subjects = query_share(search_url.url, no_subjects_query)\n",
    "total_results = requests.get(OSF_APP_URL).json()['count']\n",
    "results_percent = (float(results_with_no_subjects['count'])/total_results)*100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    '{} results out of {}, or {}%, do not have subjects.'.format(\n",
    "        results_with_no_subjects['count'],\n",
    "        total_results,\n",
    "        format(results_percent, '.2f')\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Using sharepa for SHARE Parsing and Analysis\n",
    "\n",
    "- sharepa - short for SHARE Parsing and Analysis\n",
    "    - https://github.com/CenterForOpenScience/sharepa#sharepa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Actions\n",
    "\n",
    "A basic search will provide access to all documents in SHARE in 10 document slices.\n",
    "\n",
    "#### Count\n",
    "You can use sharepa and the basic search to get the total number of documents in SHARE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sharepa import basic_search\n",
    "\n",
    "basic_search.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Iterating Through Results\n",
    "\n",
    "A basic iteration through results will yield 10 at a time, starting from the first documents collected.\n",
    "\n",
    "Let's do a basic search and iterate through the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results = basic_search.execute()\n",
    "\n",
    "for hit in results:\n",
    "    print(hit.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Slicing Results\n",
    "\n",
    "You can use slices to access a different set of results.\n",
    "\n",
    "Let's print out 5 results, starting from the 20th and going until the 25th."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results = basic_search[20:25].execute()\n",
    "for hit in results:\n",
    "    print(hit.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Sorting Results\n",
    "\n",
    "By default, the oldest results are returned first.\n",
    "\n",
    "You can instead sort results by ```ProviderUpdatedDateTime``` to get the most recent items in the SHARE dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results = basic_search.sort('-providerUpdatedDateTime').execute()\n",
    "\n",
    "for hit in results:\n",
    "    print('{} - Last updated on {}'.format(\n",
    "            hit.title.encode('utf-8'), \n",
    "            datetime.strftime(datetime.strptime(hit.providerUpdatedDateTime, \"%Y-%m-%dT%H:%M:%S+00:00\"), '%B %d %Y')\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Advanced Search with sharepa\n",
    "\n",
    "Queries are formed using lucene query syntax \n",
    "    - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sharepa import ShareSearch\n",
    "from sharepa.helpers import pretty_print\n",
    "\n",
    "my_search = ShareSearch()\n",
    "\n",
    "my_search = my_search.query(\n",
    "    'query_string',\n",
    "    query='subjects:*',\n",
    "    analyze_wildcard=True\n",
    ")\n",
    "\n",
    "pretty_print(my_search.to_dict())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "new_results = my_search.sort('-providerUpdatedDateTime').execute()\n",
    "\n",
    "for hit in new_results:\n",
    "    print(\n",
    "        '{} - with subjects {}\\n\\n'.format(\n",
    "            hit.title.encode('utf-8'),\n",
    "            [sub.encode('utf-8') for sub in hit.subjects]\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Debugging and Problem Solving\n",
    "\n",
    "Not everything always goes as planned when querying an unfamillar API.\n",
    "\n",
    "Here are some debugging and problem solving strategies when you're querying the SHARE API."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Start forming a search we're not too sure about\n",
    "\n",
    "We are interested in seeing how many results are specified as being in a language other than English"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "language_search = ShareSearch()\n",
    "\n",
    "language_search = language_search.query(\n",
    "    'query_string',\n",
    "    query='NOT languages=english'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results = language_search.execute()\n",
    "\n",
    "for hit in results:\n",
    "    print(hit.languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### That didn't look right.\n",
    "\n",
    "Let's look at our error:\n",
    "\n",
    "```AttributeError: 'Result' object has no attribute 'languages' ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Building up the correct query\n",
    "\n",
    "Let's ry narrowing our query to only results that have a language attribute\n",
    "\n",
    "(Language is not required, so many results won't have this information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "language_search = ShareSearch()\n",
    "\n",
    "language_search = language_search.filter(\n",
    "    'exists',\n",
    "    field=\"languages\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results = language_search.execute()\n",
    "results_percent = (float(language_search.count())/basic_search.count())*100\n",
    "\n",
    "print('There are {}/{} - or {}% documents with languages specified'.format(\n",
    "        language_search.count(),\n",
    "        basic_search.count(),\n",
    "        format(results_percent, '.2f')\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print('Here are the languages for the first 10 results:')\n",
    "\n",
    "for hit in results:\n",
    "    print(hit.languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Referencing the SHARE Schema\n",
    "\n",
    "Simplified form here: https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml\n",
    "\n",
    "#### Section on languages:\n",
    "        languages:\n",
    "            description: |-\n",
    "                The primary languages in which the content of the resource is presented. Values used for this element MUST conform to ISO 639-3. This offers three letter tags e.g. \"eng\" for English.\n",
    "            type: array\n",
    "            items:\n",
    "                type: string\n",
    "                pattern: \"[a-z][a-z][a-z]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Continuing to Refine our Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from elasticsearch_dsl import Q\n",
    "\n",
    "language_search = language_search.query(~Q(\"term\", languages=\"eng\"))\n",
    "\n",
    "results = language_search.execute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    'There are {} documents that do not have \"eng\" listed.'.format(\n",
    "        language_search.count()\n",
    "    )\n",
    ")\n",
    "\n",
    "print('Here are the languages for the first 10 results:')\n",
    "\n",
    "for hit in results:\n",
    "    print(hit.languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Complex Queries and Basic Visualization\n",
    "\n",
    "- How to use both basic HTTP requests and sharepa\n",
    "- Aggregations, or queries that will return summary statistics about the whole dataset.\n",
    "- simple data visualizations using pandas and matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Aggregations\n",
    "\n",
    "Aggregations let you quickly get summary statistics for all of SHARE results in one query."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Documents Per Source Missing Descriptions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "missing_descriptions_aggregation = {\n",
    "    \"query\": {\n",
    "        \"query_string\": {\n",
    "            \"analyze_wildcard\": True, \n",
    "            \"query\": \"NOT description:*\"\n",
    "        }\n",
    "    },\n",
    "    \"aggs\": {\n",
    "        \"sources\": {\n",
    "            \"terms\": {\n",
    "                \"field\": \"_type\", # A field where the SHARE source is stored                \n",
    "                \"min_doc_count\": 0, \n",
    "                \"size\": 0  # Will return all sources, regardless if there are results\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "results_without_descriptions = query_share(OSF_APP_URL, missing_descriptions_aggregation)\n",
    "\n",
    "missing_descriptions_counts = results_without_descriptions['aggregations']['sources']['buckets']\n",
    "\n",
    "for source in missing_descriptions_counts:\n",
    "    print('{} has {} documents without descriptions'.format(source['key'], source['doc_count']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Making Results More Useful\n",
    "\n",
    "Let's do that same query, but this time find the percentages of documents from each source insead of the numbers alone.\n",
    "\n",
    "We'll also leave out sources that have all of their descriptions to make the list more manageable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "sig_terms_agg = {\n",
    "    \"query\": {\n",
    "        \"query_string\": {\n",
    "            \"analyze_wildcard\": True, \n",
    "            \"query\": \"NOT description:*\"\n",
    "        }\n",
    "    },\n",
    "    \"aggs\": {\n",
    "        \"sources\":{\n",
    "            \"significant_terms\":{\n",
    "                \"field\": \"_type\", # A field where the SHARE source is stored                \n",
    "                \"min_doc_count\": 1, # Only results with more than one document\n",
    "                \"percentage\": {} # This will make the \"score\" parameter a percentage\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "docs_with_no_description_results = query_share(OSF_APP_URL, sig_terms_agg)\n",
    "docs_with_no_description = docs_with_no_description_results['aggregations']['sources']['buckets']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "for source in docs_with_no_description:\n",
    "    print(\n",
    "        '{}% (or {}/{}) of documents from {} have no description'.format(\n",
    "            format(source['score']*100, '.2f'),\n",
    "            source['doc_count'],\n",
    "            source['bg_count'],\n",
    "            source['key']\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Aggregations with sharepa\n",
    "\n",
    "Let's use sharepa to find out how many documents per source that do not have subjects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "no_subjects_search = ShareSearch()\n",
    "\n",
    "no_subjects_search = no_subjects_search.query(\n",
    "    'query_string',\n",
    "    query='NOT subjects:*',\n",
    "    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)\n",
    ")\n",
    "\n",
    "no_subjects_search.aggs.bucket(\n",
    "    'sources',  # Every aggregation needs a name\n",
    "    'significant_terms',  # There are many kinds of aggregations\n",
    "    field='_type',  # We store the source of a document in its type, so this will aggregate by source\n",
    "    min_doc_count=1,\n",
    "    percentage={},\n",
    "    size=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Examining the query\n",
    "\n",
    "Let's take a look at the query that sharepa generated, and we'll see that it looks a lot like the query we made by hand"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "pretty_print(no_subjects_search.to_dict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Executing the query\n",
    "\n",
    "Run the query and check out the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "aggregated_results = no_subjects_search.execute()\n",
    "\n",
    "for source in aggregated_results.aggregations['sources']['buckets']:\n",
    "    print(\n",
    "        '{}% of documents from {} do not have subjects'.format(\n",
    "            format(source['score']*100, '.2f'),\n",
    "            source['key'] \n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Top Subjects Aggregation\n",
    "\n",
    "Let's do an elasticsearch query to find out what the most used subjects are used in the dataset across all sources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "top_subjects_search = ShareSearch()\n",
    "\n",
    "top_subjects_search.aggs.bucket(\n",
    "    'subjectsTermFilter',  # Every aggregation needs a name\n",
    "    'terms',  # There are many kinds of aggregations\n",
    "    field='subjects',  # We store the source of a document in its type, so this will aggregate by source\n",
    "    min_doc_count=1,\n",
    "    exclude= \"of|and|or\",\n",
    "    size=10\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "top_subjects_results_executed = top_subjects_search.execute()\n",
    "top_subjects_results = top_subjects_results_executed.aggregations.subjectsTermFilter.to_dict()['buckets']\n",
    "\n",
    "pretty_print(top_subjects_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Plotting\n",
    "\n",
    "Here are some simple plots using pandas and matplotlib\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Creating a Dataframe\n",
    "\n",
    "To create a plot, first we need to get the data into an appropriae format.\n",
    "\n",
    "Pandas, a python plotting library, has the DataFrame format, which is a lot like a spreadsheet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "top_subjects_dataframe = pd.DataFrame(top_subjects_results)\n",
    "top_subjects_dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Plotting the Dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from matplotlib import pyplot\n",
    "%matplotlib inline\n",
    "\n",
    "top_subjects_dataframe.plot(kind='bar', x='key', y='doc_count')\n",
    "pyplot.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Complex Queries and Dataframes\n",
    "\n",
    "Let's make a new search, for all documents updated in the years 2012 to 2015 that contain the subject \"science.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "science_search = ShareSearch() #create search object\n",
    "science_search = science_search.filter( #apply filter to search\n",
    "    \"range\", #applied a range type filter\n",
    "    providerUpdatedDateTime={ #the feild in the data we compare\n",
    "        'gte':'2012-01-01', #hits must be greater than or equal to this date and...\n",
    "        'lte':'2015-12-31' #hits must be less than or equal to this date\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "science_search = science_search.filter(\n",
    "     \"prefix\",\n",
    "     subjects=\"science\"\n",
    ")\n",
    "\n",
    "science_search.aggs.bucket(\n",
    "    'sources',\n",
    "    'significant_terms',\n",
    "    field='_type',\n",
    "    min_doc_count=1,\n",
    "    percentage={},\n",
    "    size=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Take a look at the query we've built"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "pretty_print(science_search.to_dict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Make the query, and graph the result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from matplotlib import pyplot\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "science_search_results = science_search.execute()\n",
    "\n",
    "science_results = science_search_results.aggregations.sources.to_dict()  \n",
    "science_data_frame = pd.DataFrame(science_results['buckets']) \n",
    "\n",
    "science_data_frame['percents'] = (science_data_frame['score'] * 100)\n",
    "\n",
    "science_data_frame[:30].plot(kind='bar', x='key', y='percents') # Limit to the first 30 results for readability\n",
    "\n",
    "pyplot.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Plot Number of Documents by Source\n",
    "\n",
    "We'll limit it to the top 30 sources to make sure that the graph is readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sharepa import bucket_to_dataframe\n",
    "\n",
    "all_results = ShareSearch()\n",
    "\n",
    "all_results = all_results.query(\n",
    "    'query_string',\n",
    "    query='*',\n",
    "    analyze_wildcard=True\n",
    ")\n",
    "\n",
    "all_results.aggs.bucket(\n",
    "    'sources',\n",
    "    'terms',\n",
    "    field='_type',\n",
    "    size=0,\n",
    "    min_doc_count=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "all_results = all_results.execute()\n",
    "\n",
    "all_results_frame = bucket_to_dataframe(\n",
    "    '# documents by source',\n",
    "    all_results.aggregations.sources.buckets\n",
    ")\n",
    "\n",
    "all_results_frame_sorted = all_results_frame.sort(\n",
    "    ascending=False,\n",
    "    columns='# documents by source'\n",
    ")\n",
    "\n",
    "all_results_frame_sorted[:30].plot(kind='bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Different Kinds of Charts\n",
    "\n",
    "Let's make a pie chart\n",
    "\n",
    "- Limited to 10 sources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "all_results_frame_sorted[:10].plot(kind='pie', y=\"# documents by source\", legend=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## SHARE Data in the Wide World\n",
    "\n",
    "Here are some examples of how to get SHARE data into different formats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Exporting a DataFrame to csv and Excel\n",
    "\n",
    "Let's do a query and then export the results to different formats.\n",
    "\n",
    "We're interested in the number of documents from each source that have a description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "description_search = ShareSearch()\n",
    "\n",
    "description_search = description_search.query(\n",
    "    'query_string', \n",
    "    query='description:*',\n",
    "    analyze_wildcard=True\n",
    ")\n",
    "\n",
    "description_search.aggs.bucket(\n",
    "    'sources',\n",
    "    'significant_terms',\n",
    "    field='_type',\n",
    "    min_doc_count=0,\n",
    "    percentage={},\n",
    "    size=0\n",
    ")\n",
    "\n",
    "description_results = description_search.execute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Cleaning up our dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "description_dataframe = pd.DataFrame(description_results.aggregations.sources.to_dict()['buckets'])\n",
    "\n",
    "# We will add our own \"percent\" column to make things clearer\n",
    "description_dataframe['percent'] = (description_dataframe['score'] * 100)\n",
    "# And, drop the old score column\n",
    "description_dataframe = description_dataframe.drop('score', 1)\n",
    "\n",
    "# Let's set the source name as the index, and then drop the old column\n",
    "description_dataframe = description_dataframe.set_index(description_dataframe['key'])\n",
    "description_dataframe = description_dataframe.drop('key', 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# Finally, we'll show the results!\n",
    "description_dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Exporting to CSV and Excel formats\n",
    "\n",
    "Pandas has handy tools built in that makes converting a dataframe very easy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "description_dataframe.to_csv('exported_data/SHARE_Counts_with_Descriptions.csv')\n",
    "description_dataframe.to_excel('exported_data/SHARE_Counts_with_Descriptions.xlsx')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Working with Outside Data\n",
    "\n",
    "Here's a quick example of how you could work with a list of names, and use them to see what information is in SHARE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "names = [\"Susan Jones\", \"Ravi Patel\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "name_search = ShareSearch()\n",
    "\n",
    "for name in names:\n",
    "    name_search = name_search.query(\n",
    "        {\n",
    "            \"bool\": {\n",
    "                \"should\": [\n",
    "                    {\n",
    "                        \"match\": {\n",
    "                            \"contributors.name\": {\n",
    "                                \"query\": name, \n",
    "                                \"operator\": \"and\",\n",
    "                                \"type\" : \"phrase\"\n",
    "                            }\n",
    "                        }\n",
    "                    }\n",
    "                ]\n",
    "            }\n",
    "        }\n",
    "    )\n",
    "\n",
    "\n",
    "name_results = name_search.execute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    'There are {} documents with contributors who have any of those names.'.format(\n",
    "        name_search.count()\n",
    "    )\n",
    ")\n",
    "\n",
    "print('Here are the first 10:')\n",
    "print('---------')\n",
    "for result in name_results:\n",
    "    print(\n",
    "        '{} -- with contributors {}'.format(\n",
    "            result.title.encode('utf-8'),\n",
    "            ', '.join([contributor.name.encode('utf-8') for contributor in result.contributors])\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Where did these results come from?\n",
    "\n",
    "We can add an aggregation!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "name_search.aggs.bucket(\n",
    "    'sources',  # Every aggregation needs a name\n",
    "    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though\n",
    "    field='_type',  # We store the source of a document in its type, so this will aggregate by source\n",
    "    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs\n",
    "    min_doc_count=1\n",
    ")\n",
    "\n",
    "name_results = name_search.execute()\n",
    "\n",
    "pd.DataFrame(name_results.aggregations.sources.to_dict()['buckets'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Searching by ORCID "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "orcids = [\n",
    "    'http://orcid.org/0000-0003-1942-4543',\n",
    "    'http://orcid.org/0000-0003-4875-1447',\n",
    "    'http://orcid.org/0000-0002-6085-4433',\n",
    "    'http://orcid.org/0000-0002-7995-9948',\n",
    "    'http://orcid.org/0000-0002-2170-853X',\n",
    "    'http://orcid.org/0000-0002-8899-9087'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "orcid_search = ShareSearch()\n",
    "\n",
    "for orcid in orcids:\n",
    "    orcid_search = orcid_search.query(\n",
    "        {\n",
    "            \"bool\": {\n",
    "                \"should\": [\n",
    "                    {\n",
    "                        \"match\": {\n",
    "                            \"contributors.sameAs\": {\n",
    "                                \"query\": orcid, \n",
    "                                \"operator\": \"and\",\n",
    "                                \"type\" : \"phrase\"\n",
    "                            }\n",
    "                        }\n",
    "                    }\n",
    "                ]\n",
    "            }\n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "orcid_search.aggs.bucket(\n",
    "    'sources',\n",
    "    'terms',\n",
    "    field='_type',\n",
    "    size=0,\n",
    "    min_doc_count=1\n",
    ")\n",
    "\n",
    "orcid_results = orcid_search.execute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    'There are {} documents with contributors who have any of those orcids.'.format(\n",
    "        orcid_search.count()\n",
    "    )\n",
    ")\n",
    "\n",
    "all_agg_df = pd.DataFrame()\n",
    "all_agg_df['title'] = [result.title for result in orcid_results]\n",
    "all_agg_df['docID'] = [result.shareProperties.docID for result in orcid_results]\n",
    "all_agg_df['source'] = [result.shareProperties.source for result in orcid_results]\n",
    "all_agg_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## This is just the surface!\n",
    "\n",
    "The SHARE API has the potential to answer many questions about our data\n",
    "\n",
    "Data curation and enhancement will only make these analasyes more interesting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Thank you!\n",
    "\n",
    "## Questions?\n",
    "\n",
    "**email**: erin@cos.io\n",
    "\n",
    "*SHARE Technical Documentation and Information*: https://osf.io/t3j94/\n",
    "\n",
    "*these slides and example notebooks*:  https://osf.io/bygau/"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
